8 research outputs found
Doctor of Philosophy
dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data
Power to the points: validating data memberships in clusterings
pre-printIn this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to "own" the point. We demonstrate that these scores accurately capture the quality of the label assigned to the point. We also show further applications of these scores to estimate global measures of clustering quality, as well as accelerate clustering algorithms by orders of magnitude using active selection based on affinity. This method is very general and applies to clusterings derived from any geometric source. It lends itself to easy visualization and can prove useful as part of an interactive visual analytics framework. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent both of the size and dimensionality of the data. It is based on techniques from the theory of interpolation, coupled with sampling and estimation algorithms from high dimensional computational geometry
Spatially-Aware Comparison and Consensus for Clusterings
This paper proposes a new distance metric between clusterings that
incorporates information about the spatial distribution of points and clusters.
Our approach builds on the idea of a Hilbert space-based representation of
clusters as a combination of the representations of their constituent points.
We use this representation and the underlying metric to design a
spatially-aware consensus clustering procedure. This consensus procedure is
implemented via a novel reduction to Euclidean clustering, and is both simple
and efficient. All of our results apply to both soft and hard clusterings. We
accompany these algorithms with a detailed experimental evaluation that
demonstrates the efficiency and quality of our techniques.Comment: 12 Pages, 9 figures, Proceedings of 2011 Siam International
Conference on Data Minin
A Geometric Algorithm for Scalable Multiple Kernel Learning
We present a geometric formulation of the Multiple Kernel Learning (MKL)
problem. To do so, we reinterpret the problem of learning kernel weights as
searching for a kernel that maximizes the minimum (kernel) distance between two
convex polytopes. This interpretation combined with novel structural insights
from our geometric formulation allows us to reduce the MKL problem to a simple
optimization routine that yields provable convergence as well as quality
guarantees. As a result our method scales efficiently to much larger data sets
than most prior methods can handle. Empirical evaluation on eleven datasets
shows that we are significantly faster and even compare favorably with a
uniform unweighted combination of kernels.Comment: 20 page
Power to the Points: Validating Data Memberships in Clusterings
A clustering is an implicit assignment of labels of points, based on
proximity to other points. It is these labels that are then used for downstream
analysis (either focusing on individual clusters, or identifying
representatives of clusters and so on). Thus, in order to trust a clustering as
a first step in exploratory data analysis, we must trust the labels assigned to
individual data. Without supervision, how can we validate this assignment? In
this paper, we present a method to attach affinity scores to the implicit
labels of individual points in a clustering. The affinity scores capture the
confidence level of the cluster that claims to "own" the point. This method is
very general: it can be used with clusterings derived from Euclidean data,
kernelized data, or even data derived from information spaces. It smoothly
incorporates importance functions on clusters, allowing us to eight different
clusters differently. It is also efficient: assigning an affinity score to a
point depends only polynomially on the number of clusters and is independent of
the number of points in the data. The dimensionality of the underlying space
only appears in preprocessing. We demonstrate the value of our approach with an
experimental study that illustrates the use of these scores in different data
analysis tasks, as well as the efficiency and flexibility of the method. We
also demonstrate useful visualizations of these scores; these might prove
useful within an interactive analytics framework.Comment: 18 pages, 9 figures, 5 table